A Framework of Online Learning with Imbalanced Streaming Data

نویسندگان

  • Yan Yan
  • Tianbao Yang
  • Yi Yang
  • Jianhui Chen
چکیده

A challenge for mining large-scale streaming data overlooked by most existing studies on online learning is the skewdistribution of examples over different classes. Many previous works have considered cost-sensitive approaches in an online setting for streaming data, where fixed costs are assigned to different classes, or ad-hoc costs are adapted based on the distribution of data received so far. However, it is not necessary for them to achieve optimal performance in terms of the measures suited for imbalanced data, such as Fmeasure, area under ROC curve (AUROC), area under precision and recall curve (AUPRC). This work proposes a general framework for online learning with imbalanced streaming data, where examples are coming sequentially and models are updated accordingly on-the-fly. By simultaneously learning multiple classifiers with different cost vectors, the proposed method can be adopted for different target measures for imbalanced data, including F-measure, AUROC and AUPRC. Moreover, we present a rigorous theoretical justification of the proposed framework for the F-measure maximization. Our empirical studies demonstrate the competitive if not better performance of the proposed method compared to previous cost-sensitive and resampling based online learning algorithms and those that are designed for optimizing certain measures.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Online Streaming Feature Selection Using Geometric Series of the Adjacency Matrix of Features

Feature Selection (FS) is an important pre-processing step in machine learning and data mining. All the traditional feature selection methods assume that the entire feature space is available from the beginning. However, online streaming features (OSF) are an integral part of many real-world applications. In OSF, the number of training examples is fixed while the number of features grows with t...

متن کامل

Online Imbalanced Learning with Kernels

Imbalanced learning, or learning from imbalanced data, is a challenging problem in both academy and industry. Nowadays, the streaming imbalanced data become popular and trigger the volume, velocity, and variety issues of learning from these data. To tackle these issues, online learning algorithms are proposed to learn a linear classifier via maximizing the AUC score. However, the developed line...

متن کامل

Learning to Classify Data Streams with Imbalanced Class Distributions

Streaming data is pervasive in a multitude of data mining applications. One fundamental problem in the task of mining streaming data is distributional drift over time. Streams may also exhibit high and varying degrees of class imbalance, which can further complicate the task. In scenarios like these, class imbalance is particularly difficult to overcome and has not been as thoroughly studied. I...

متن کامل

Kernelized Online Imbalanced Learning with Fixed Budgets

Online learning from imbalanced streaming data to capture the nonlinearity and heterogeneity of the data is significant in machine learning and data mining. To tackle this problem, we propose a kernelized online imbalanced learning (KOIL) algorithm to directly maximize the area under the ROC curve (AUC). We address two more challenges: 1) How to control the number of support vectors without sac...

متن کامل

Enhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining

This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017